Data-Driven Discovery of Protein Function Classifiers: Decision Trees Based on MEME Motifs outperform PROSITE Patterns and Profiles on Peptidase Families
نویسندگان
چکیده
This paper describes an approach to data-driven discovery of decision trees or rules for assigning protein sequences to functional families using sequence motifs. This method is able to capture regularities that can be described in terms of presence or absence of arbitrary combinations of motifs. A training set of peptidase sequences labeled with the corresponding MEROPS functional families or clans is used to automatically construct decision trees that capture regularities that are sufficient to assign the sequences to their respective functional families. The performance of the resulting decision tree classifiers is then evaluated on an independent test set. We compared the rules constructed using motifs generated by a multiple sequence alignment based motif discovery tool (MEME) with rules constructed using expert annotated PROSITE motifs (patterns and profiles). Our results indicate that the former provide a potentially powerful high throughput technique for constructing protein function classifiers when adequate training data are available. Examination of the generated rules in the case of a Caspase (C14) family suggests that the proposed technique may be able to identify combinations of sequence motifs that characterize functionally significant 3dimensional structural features of proteins. 1. BACKGROUND AND INTRODUCTION Assigning putative functions to protein sequences remains one of the most challenging problems in functional genomics. The function of a protein depends to a large extent on its 3-dimensional structure; the shape of the protein both constrains and facilitates the ways in which the protein can interact with other proteins. Proteins with similar 3-dimensional structural features very often, but not always, have similar functions. However, experimental determination of protein structures using NMR or X-ray crystallography techniques is time consuming and expensive. While there are 254,293 protein records in PIR-PSD database [Release 70.01, Oct-2001], [Baker et al., 2001], there are only 14,339 experimentally determined 3-dimensional protein structures in the Protein Data Bank (PDB) [version 23-Oct-2001] [Berman et al., 2000], corresponding to approximately 3000 different proteins. Hence, protein function prediction often relies on protein structure prediction using computational approaches. Ab initio methods that predict the conformation of a protein from its amino acid sequence are computationally very demanding and are currently limited to relatively short proteins or peptides [Samudrala et al., 1999]. Early work on protein pattern recognition [Dayhoff et al., 1983] suggested that short sequences of amino acid (motifs) may be conserved in a protein family. Currently, motif composition is often used to assign putative functions to novel protein sequences based on the known functions of other proteins that share one or more motifs with the novel protein. Several databases that contain motifs e.g., PROSITE [Hofmann et al., 1999], or groups of motifs referred to as fingerprints or blocks e.g., PRINTS [Attwood et al., 2000], or sequence patterns, often based on weight matrices or hidden Markov models generated from multiple sequence alignments, called profiles, PROSITE [Hofmann et al., 1999] or domains Pfam [Bateman et al., 2000] have been developed. Such motif databases or resources that integrate such databases e.g., InterPro [Apweiler et al., 2001], MetFam [Silverstein et al., 2001] can be queried using a protein sequence to obtain a list of motifs that are found in the sequence as well as the functions or structures associated with these motifs. Motif-based techniques for protein function prediction focus similarity searches on parts of the protein that are likely to be functionally or structurally significant, and hence more likely to be conserved. Current motif-based approaches to protein function prediction are not without drawbacks. Many proteins contain several motifs and the same motif may be found in proteins belonging to several different functional families. More generally, it may be necessary to identify combinations of motifs that must present, or perhaps even absent in a sequence, in order to reliably assign it to a functional family. Indeed, in the PRINTS database [Attwood, et al., 2000], the fingerprints used to assign proteins to functional families can be simple motifs or a combination of motifs. However, the process of identifying a fingerprint for each protein family of interest can be labor intensive and requires considerable domain knowledge. Thus, there is a need for sophisticated tools that automate the discovery of sequence regularities predictive of protein function and allow efficient updating of databases. In this paper, we test the feasibility of a fully automated approach for protein function classification. We present a data-driven approach to discovery of rules for assigning protein sequences to functional families on the basis of the presence or absence of specific motifs or combinations of motifs. (For simplicity, we will use the term motif to include short conserved sequence patterns as well as profiles.) Machine learning algorithms [Mitchell, 1997] offer one the most cost effective approaches to automated discovery of a-priori unknown predictive 1 This research was supported in part by grants from the National Science Foundation (9982341, 9972653), the Carver Foundation, and Pioneer Hi-Bred, Inc. This research has benefited from interactions with Dr. Dake Wang, Zhong Gao, Changhui Yan, and Carson Andorf of the Iowa State University Artificial Intelligence Research Laboratory. relationships from large data sets. Decision tree induction algorithms are relatively fast, and produce rules that are easy to interpret (and hence understandable by humans). Machine learning approaches have been previously used for protein function classification. For example, King et al. [2001] investigated an inductive logic programming approach to the construction of protein function classifiers using alternative representations of protein sequences (amino acid residue frequencies, phylogeny, and predicted structure). In a previous study, we used the C4.5 family of decision tree induction algorithms [Quinlan, 1992] to discover rules for protein classification on the basis of presence or absence of combinations of PROSITE motifs with encouraging results [Wang, et al., 2001]. The study demonstrated, for several protein families, that decision tree classifiers generated using PROSITE patterns and motifs can provide more accurate protein family classification than the use of a single characteristic motif. PROSITE patterns are usually fairly short (less than 20 amino acids) and typically correspond to biologically significant sites experimentally identified in PROSITE functional families. PROSITE profiles, on the other hand, correspond to Hidden Markov models that usually match longer sequence fragments (often over 100 amino acids). These longer profiles are useful as "signatures" for protein families, but make it difficult to identify underlying sequence regularities that are predictive of protein function, or may correspond to biologically significant structural features. Here we explore whether it is possible to use relatively short, automatically generated motifs to discover rules for protein classification A variety of automated approaches have been developed for identification of motifs (see [Hudak and McClure, 1999] for a comparison of several such motif detection methods). In this study, we used MEME (Multiple Expectation Maximization for Motif Elicitation) [Bailey et al., 1999], a multiple sequence alignment based motif discovery program which can be used to automate the construction of motif databases from any given set of sequences. For our data set, we chose a well-characterized subset of protein families from the MEROPS protease database [Release 5.4 23-Mar-2000] [Rawlings et al., 2000]. We compared rules discovered based on motifs automatically generated using MEME with those generated based on PROSITE patterns and profiles [Hofmann et al., 1999]. Further, we investigated the ability of decision trees to discover functionally significant structural features of proteins using the caspase protease family as a test case. 2. DATA DRIVEN DISCOVERY OF RULES FOR PROTEIN FUNCTION CLASSIFICATION USING SEQUENCE MOTIFS The basic computational problem is the following: Given a database or training set of amino acid sequences corresponding to proteins with known (i.e., experimentally determined) function, our goal is to induce a classifier that would be able to assign novel protein sequences to one of the protein families represented in the training set. The general approach is illustrated in Figure 1. Data Representation The first step in this process is the preparation of a data set. A majority of algorithms for data-driven induction of pattern classifiers represent instances to be classified using a fixed set of attributes. Hence, we first map each protein sequence into a corresponding attribute-based representation [Wang et al., 2001]. The choice of attributes plays a critical role in the data mining process. We represent protein sequences using a suitable vocabulary of sequence motifs. The set of motifs to be used can be chosen to correspond to one of the existing motif databases (e.g., PROSITE) or the set of motifs identified by running a suitable motif-finding program (e.g., MEME) on the set of protein sequences. Suppose the vocabulary contains N motifs. Any given sequence typically contains a few of these motifs. We encode each sequence as an N-bit binary pattern where the ith bit is 1 if the corresponding motif is present in the sequence; otherwise the corresponding bit is 0. Each N-bit sequence is associated with a label which identifies the functional family of the sequence (if known). A training set is simply a collection of N-bit binary patterns, each of which has associated with it a label that identifies the functional family of the corresponding protein. This training set can be used to train a classifier which can then be used to assign novel sequences to one of the several functional families represented in the training set. This process is illustrated in Figure 1. Data Set Used A subset of the peptidase (protease) families classified according to the MEROPS two-level classification system [Rawlings and Barrett, 1993] was used in this study. The MEROPS database (http://www.merops.co.uk/) classifies proteases into functional families and clans. Clans are groupings of related functional families. The choice of the peptidase families was motivated by the diversity of the proteins in the family and the fact that many of them are well-characterized and have known structures and functions [Barrett et al., 1998]. The fact that the peptidases have a two-level classification structure allows for the analysis of the performance of the rule sets on the two levels (clans and families). The data set used in this study can be found at http://www.cs.iastate.edu/~xyunwang/data. For this study, all MEROPS-defined protease families that had more than 2 protein members and belonged to a clan were chosen. Clans with less than two member proteins were excluded from the data set. Protein sequences that were only fragments were then removed, leaving 1933 proteins. The resulting 84 families had between 3 and 313 members. A total of 19 clans were used, each with between 1 and 18 families. In order to avoid excessive bias in favor of large families (i.e., those consisting of a large number of members), only 50-100 randomly chosen proteins from large families were selected, resulting in a data set of 1627 proteins. MEME motifs were extracted from these 84 families of proteins. Data Set of Proteins with known function Motif-based Representation of Sequences Test Set for Classifier Evaluation Learning algorithm Motif-based Representation of a Novel Protein Protein Classification Training Set
منابع مشابه
Automated data-driven discovery of motif-based protein function classifiers
AUTOMATED DATA-DRIVEN DISCOVERY OF MOTIF-BASED PROTEIN FUNCTION CLASSIFIERS Xiangyun Wang, Diane Schroeder, Drena Dobbs, and Vasant Honavar Artificial Intelligence Laboratory Department of Computer Science and Graduate Program in Bioinformatics and Computational Biology Iowa State University Ames, IA 50011, USA www.cs.iastate.edu/~honavar/aigroup.html [email protected] ABSTRACT This paper ...
متن کاملDiscovering Protein Function Classification Rules from Reduced Alphabet Representations of Protein Sequences
The paper explores the use of reduced alphabet representations of protein sequences in the data-driven discovery of data-driven discovery of sequence motif-based decision trees for classifying protein sequences into functional families. A number of alternative representations of protein sequences (using a variety of reduced alphabets based on groupings of amino acids in terms of their physico -...
متن کاملThe Value of Prior Knowledge in Discovering Motifs with MEME
MEME is a tool for discovering motifs in sets of protein or DNA sequences. This paper describes several extensions to MEME which increase its ability to find motifs in a totally unsupervised fashion, but which also allow it to benefit when prior knowledge is available. When no background knowledge is asserted. MEME obtains increased robustness from a method for determining motif widths automati...
متن کاملData-Driven Generation of Decision Trees for Motif-Based Assignment of Protein Sequences to Functional Families
This paper describes an approach to data-driven discovery of sequence motif-based models in the form of decision trees for assigning protein sequences to functional families. Unlike approaches that try to classify protein sequences based on presence of a single motif, this method is able to capture regularities that can be described in terms of presence or absence of arbitrary combinations of m...
متن کاملiProsite: an improved prosite database achieved by replacing ambiguous positions with more informative representations
PROSITE database contains a set of entries corresponding to protein families, which are used to identify the family of a protein from its sequence. Although patterns and profiles are developed to be very selective, each may have false positive or negative hits. Considering false positives as items that reduce the selectiveness of a pattern, then, the more selective pattern we have, a more accur...
متن کامل